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Thinning, Entropy and the Law of Thin Numbers 

Peter Harremoes, Oliver Johnson and loannis Kontoyiannis 



Abstract 

Renyi's thinning operation on a discrete random variable is a natural discrete analog of the scaling operation 
for continuous random variables. The properties of thinning are investigated in an information-theoretic context, 
especially in connection with information-theoretic inequalities related to Poisson approximation results. The 
classical Binomial-to-Poisson convergence (sometimes referred to as the "law of small numbers") is seen to be a 
special case of a thinning limit theorem for convolutions of discrete distributions. A rate of convergence is provided 
for this limit, and nonasymptotic bounds are also established. This development parallels, in part, the development 
of Gaussian inequalities leading to the information-theoretic version of the central limit theorem. In particular, a 
"thinning Markov chain" is introduced, and it is shown to play a role analogous to that of the Ornstein-Uhlenbeck 
process in connection to the entropy power inequality. 
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I. Introduction 

Approximating the distribution of a sum of weakly dependent discrete random variables by a Poisson 
distribution is an important and well-studied problem in probability; see ^ and the references therein 
for an extensive account. Strong connections between these results and information-theoretic techniques 
were established |fT5l[|24ll . In particular, for the special case of approximating a binomial distribution by 
a Poisson, some of the sharpest results to date are established using a combination of the techniques 
lfT5lll24l and Pinsker's inequality fTll ifTOl ifTSl . Earlier work on information-theoretic bounds for Poisson 
approximation is reported in ll36l[[ 2T |ll28l . 

The thinning operation, which we define next, was introduced by Renyi in [|29ll . who used it to provide 
an alternative characterization of Poisson measures. 

Definition 1: Given a E [0, 1] and a discrete random variable X with distribution P on No = {0, 1, . . .}, 
the a-thinning of P is the distribution Ta{P) of the sum, 

X 

Bj:, where Bi, i?2 • • ■ ~ i.i.d. Bern(a), (1) 

x=l 

where the random variables {B^} are independent and identically distributed (i.i.d.) each with a Bernoulli 
distribution with parameter a, denoted Bem(a), and also independent of X. [As usual, we take the empty 
sum Ylx=ii') equal to zero.] An explicit representation of Ta{P) can be given as, 

T^{p){z) = n^) Q - «)'"^ ^ > 0. (2) 

x=z ^ ^ 

When it causes no ambiguity, the thinned distribution Ta{P) is written simply T^P. 

For any random variable X with distribution P on No, we write P*" for the n-fold convolution of P 
with itself, i.e., the distribution of the sum of n i.i.d. copies of X. For example, if P ~ Bem(p), then 
p*" ~ Bm(n,p), the binomial distribution with parameters n and p. It is easy to see that its (1/?^)- 
thinning, Ti/„(P*"), is simply Bin(n,p/n); see Example [6] below. Therefore, the classical Binomial-to- 
Poisson convergence result - sometimes referred to as the "law of small numbers" - can be phrased as 
saying that, if P ~ Bem(jo), then, 

Ti/„(P*") Po(p), asn^cx), (3) 

where Po(A) denotes the Poisson distribution with parameter A > 0. 

One of the main points of this work is to show that this result holds for very wide class of distributions 
P, and to provide conditions under which several stronger and more general versions of ([3]) can be 
obtained. We refer to results of the form Q as laws of thin numbers. 

Section In] contains numerous examples that illustrate how particular families of random variables behave 
on thinning, and it also introduces some of the particular classes of random variables that will be considered 
in the rest of the paper. In Sections UlI] and |IV] several versions of the law of thin numbers are formulated; 
first for i.i.d. random variables in Section |llll and then for general classes of (not necessarily independent or 
identically distributed) random variables in Section |lVl For example, in the simplest case where Yi, I2, • • • 
are i.i.d. with distribution P on Nq and with mean A, so that the distribution of their sum, S'„ = Yi + 
F2 H h F„, is P*", Theorem [H shows that, 

D (ri/„(P*")||Po(A)) ^0, as n ^ 00, (4) 
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as long as D{P\\Po{X)) < oo, where, as usual, D{P\\Q) denotes the information divergence, or relative 
entropy, from P to QlI 

Note that, unlike most classical Poisson convergence results, the law of thin numbers in (|4]) proves a 
Poisson limit theorem for the sum of a single sequence of random variables, rather than for a triangular 
array. 

It may be illuminating to compare the result dH) with the information-theoretic version of the central limit 
theorem (CLT); see, e.g., [|2l [fT9l . Suppose Yi, Y2, . . . are i.i.d. continuous random variables with density / 
on M, and with zero mean and unit variance. Then the density of their sum Sn = Y1 + Y2 + ■ ■ ■ + Yn, is the 
n-fold convolution /*" of / with itself. Write for the standard scaling operation in the CLT regime: If 
a continuous random variable X has density /, then Sq,(/) is the density of the scaled random variable 
\f(xX, and, in particular, the density of the standardized sum -^Sn is The information-theoretic 

CLT states that, if < 00, we have, 

^0, asn^oo, (5) 

where is the standard Normal density. Note the close analogy between the statements of the law of thin 
numbers in © and the CLT in dS]). 

Before describing the rest of our results, we mention that there is a significant thread in the literature 
on thinning limit theorems and associated results for point processes. Convergence theorems of the "law 
of thin numbers" type, as in ([3]) and dH), were first examined in the context of queueing theory by 
Palm [27J and Khinchin [|22|. while more general results were established by Grigelionis [14J. See the 
discussion in the text, [P, pp. 146-166], for details and historical remarks; also see the comments following 
Theorem [T6l in Section |IVl More specifically, this line of work considered asymptotic results, primarily in 
the sense of weak convergence, for the distribution of a superposition of the sample paths of independent 
(or appropriately weakly dependent) point processes. Here we take a different direction and, instead of 
considering the full infinite-dimensional distribution of a point process, we focus on finer results - e.g., 
convergence in information divergence and non- asymptotic bounds - for the one-dimensional distribution 
of the thinned sum of integer- valued random variables. 

With these goals in mind, before examining the finite-n behavior of Ti/„(P*"), in Section |V] we study 
a simpler but related problem, on the convergence of a continuous -time "thinning" Markov chain on Nq. 
In the present context, this chain plays a role parallel to that of the Ornstein-Uhlenbeck process in the 
context of Gaussian convergence and the entropy power inequality OTlllSlllllSl . We show that the thinning 
Markov chain has the Poisson law as its unique invariant measure, and we establish its convergence both in 
total variation and in terms of information divergence. Moreover, in Theorem |28] we characterize precisely 
the rate at which it converges to the Poisson law in terms of the distance, which also leads to an upper 
bound on its convergence in information divergence. A new characterization of the Poisson distribution 
in terms of thinning is also obtained. The main technical tool used here is based on an examination of 
the properties of the Poisson-Charlier polynomials in the thinning context. 

'Throughout the paper, log denotes the natural logarithm to base e, and we adopt the usual convention that OlogO = 0. 
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In Section |VI] we give both asymptotic and finite-n bounds on the rate of convergence for the law of 
thin numbers. Specifically, we employ the scaled Fisher information functional introduced in ll24l to give 
precise, explicit bounds on the divergence, i5(Ti/„(P*") ||Po(A)). An example of the type of result we 
prove is the following: Suppose X is an ultra bounded (see Definition [8] in Section HH) random variable, 
with distribution P, mean A, and finite variance 7^ A. Then, 

limsupn^/} (Ti/„(P™)||Po(A)) < 2c^, 

for a nonzero constant c we explicitly identify; cf. Corollary [32l 

Similarly, in Section IVIIII we give both finite-n and asymptotic bounds on the law of small numbers in 
terms of the total variation distance, ||Ti/„(P*'') — Po(A) ||, between Ti/„(P*") and the Po(A) distribution. 
In particular. Theorem [36] states that if X ~ P has mean A and finite variance a^, then, for all n, 

\\Ti/niP*n - Po(A)|| < + |l, ^} . 

A closer examination of the monotonicity properties of the scaled Fisher information in relation to the 
thinning operation is described in Section IVIIi Finally, Section |K] shows how the idea of thinning can 
be extended to compound Poisson distributions. The Appendix contains the proofs of some of the more 
technical results. 

Finally we mention that, after the announcement of the present results in [fTTll . Yu ll35l also obtained 
some interesting, related results. In particular, he showed that the conditions of the strong and thermo- 
dynamic versions of the law of thin numbers (see Theorems [14] and [T2]) can be weakened, and he also 
provided conditions under which the convergence in these limit theorems is monotonic in n. 
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II. Examples of Thinning and Distribution Classes 

This section contains several examples of the thinning operation, statements of its more basic properties, 
and the definitions of some important classes of distributions that will be play a central role in the rest 
of this work. The proofs of all the lemmas and propositions of this section are given in the Appendix. 

Note, first, two important properties of thinning that are immediate from its definition: 

1. The thinning of a sum of independent random variables is the convolution of the corresponding 
thinnings. 

2. For all a, P E [0,1] and any distribution P on No, we have, 

T^{Tp{P))=T^p{P). (6) 

Example 2: Thinning preserves the Poisson law, in that Tq,(Po(A)) = Po(aA). This follows from Q, 
since, 



oo 

Ta{Po{X)){z) = VPo(A,x)('^ ]a'{l-a 



x—z 



x=z 
oo 

. / T \ 

x—z 



x=z 
-A 



z\ ^ — ' (X — Z)\ 

x=z ^ ' 

= ^(«A)V(^-") 
z\ 

= Po(aA, z), 

where Po(A,x) = e~^\^/x\, x > 0, denotes the Poisson mass function. 

As it turns out, the factorial moments of a thinned distribution are easier to work with than ordinary 
moments. Recall that the kth factorial moment of X is E[X-], where x- denotes the falling factorial, 

x\ 

X- = xix - I) ■ ■ - ix - k + I) 



(x-k)]' 

The factorial moments of an a-thinning are easy to calculate: 

Lemma 3: For any random variable Y with distribution P on Nq and for a E (0, 1), writing for a 
random variable with distribution T^P: 

E[Y^] = for all k. (7) 

That is, thinning scales factorial moments in the same way as ordinary multiplication scales ordinary 
moments. 

We will use the following result, which is a multinomial version of Vandermonde's identity and is 
easily proved by induction. The details are omitted. 

Lemma 4: The falling factorial satisfies the multinomial expansion, i.e., for any positive integer y, all 
integers xi, X2, ■ ■ ■ , Xy, and any k > 1, 



i=l ki,k2...,ky : ki+k2+...+ky=k ^ ^ ^ y / j=i 
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The following is a basic regularity property of the thinning operation. 
Proposition 5: For any a E (0, 1), the map P i— Ta{P) is injective. 

Example 6: Thinning preserves the class of Bernoulli sums. That is, the thinned version of the dis- 
tribution of a finite sum of independent Bernoulli random variables (with possibly different parameters) 
is also such a sum. This follows from property 1 stated in the beginning of this section, combined with 
the observation that the a-thinning of the Bern(jo) distribution is the Bem(a;p) distribution. In particular, 
thinning preserves the binomial family: Ta(Bm(n,p)) = Bm{n,ap). 

Example 7: Thinning by a transforms a geometric distribution with mean A into a geometric distribution 
with mean a\. Recalling that the geometric distribution with mean A has point probabilities, 

1 / A \ ^ 
Geo(A,x) = ^^(^Y^j ' a; = 0,1,..., 

using (O, 



1 f a\ \ ^ f \{l — a) 



'1 + X)z\ \1 + XJ ^\ 1 + A 

\ / \ / x=Z ^ 

1 ( ^ \ ( ^ -^(1 ^ ^ 



X- 

-2-1 



{i + x)z\ \i + xj ■ V 1 + ^ 

= Geo(aA, z). 

The sum of n i.i.d. geometries has a negative binomial distribution. Thus, in view of this example and 
property 1 stated in the beginning of this section, the thinning of a negative binomial distribution is also 
negative binomial. 

Partly motivated by these examples, we describe certain classes of random variables (some of which 
are new). These appear as natural technical assumptions in the subsequent development of our results. 
The reader may prefer to skip the remainder of this section and only refer back to the definitions when 
necessary. 

Definition 8: 1) A Bernoulli sum is a distribution that can be obtained from the sum of finitely many 
independent Bernoulli random variables with possibly different parameters. The class of Bernoulli 
sums with mean A is denoted by Ber(X) and the the union U^<xBer{X) is denoted by Ber-{X). 

2) A distribution P satisfying the inequality 

Po(A,j) 2 Po(A,j-l) 2 Po(A,j + l) 



is said to be ultra log-concave (ULC); cf. 11201 . The set of ultra log-concave distributions with mean 
A shall be denoted ULC{X), and we also write ULC-{X) for the union U^<xULC{X). Note that 
([8]) is satisfied for a single value of A > if and only if it is satisfied for all A > 0. 
3) The distribution of a random variable X that satisfies E[X^] < XE[X^] for all /c > will be 
said to be ultra bounded (UB) with ratio A. The set of ultra bounded distributions with this ratio is 
denoted UB{X). 
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4) The distribution of a random variable X satisfying < A'^ for all A; > will be said to 
be Poisson bounded (PB) with ratio A. The set of Poisson bounded distributions with this ratio is 
denoted PB{X). 

5) A random variable will be said to be ULC, UB or PB, if its distribution is ULC, UB or PB, 
respectively. 

First we mention some simple relationships between these classes. Walkup [|33l showed that if X ~ 
P e ULC{\) and r ~ g G ULC{^i) then X + Y ^ P *Q e ULC{X + jj). Hence Ber{X) C ULC{X). 
In [I20l it was shown that, if P e ULC{\), then T„P e ULC{a\). Clearly, UB{\) C PB{\). Further, 
P is Poisson bounded if and only if the a-thinning T^P is Poisson bounded, for some a > 0. The same 
holds for ultra boundedness. 

Proposition 9: In the notation of Definition [8l ULC{\) C UB{\). That is, if the distribution of X is 
in ULC{\) then E[X^] < XE[X^]. 

The next result states that the PB and UB properties are preserved on summing and thinning. 

Proposition 10: 

(a) If X ~ P G PP(A) and F ~ g G PP(/u) are independent, then X + Y P *Q e PB{\ + fi) 
and r„P G PB{a\). 

(b) If X ~ P G UB{\) and r ~ g G UB{ix) are independent, then X + F ~ P * g G UB{X + /i) 
and T„P G UB{a\). 

Formally, the above discussion can be summarized as, 

Per^(A) C ULC^iX) C UB{X) C PP(A). 

Finally, we note that each of these classes of distributions is "thinning-convex," i.e., if P and Q are 
element of a set then Ta{P) * Ti^aiQ) is also an element of the same set. In particular, thinning maps 
each of these sets into itself, since Ta{P) = Ta{P) * Ti_a{So) where Sq, the point mass at zero, has 
6o G Per^(A). 
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III. Laws of Thin Numbers: The i.i.d. Case 

In this section we state and prove three versions of the law of thin numbers, under appropriate conditions; 
recall the relevant discussion in the Introduction. Theorem \TT\ proves convergence in total variation, 
Theorem [121 in entropy, and Theorem [14] in information divergence. 

Recall that the total variation distance \\P — Q\\ between two probability distributions P, Q on Mq is, 

IIP - gil := sup \P{B) - Q{B)\ = l-J2 l^(^) - (9) 

Theorem 11 (weak version): For any distribution P on No with mean A, 

||Ti/„(P*")-Po(A)|| ^0, n^oo. 

Proof: In view of Scheffe's lemma, pointwise convergence of discrete distributions is equivalent to 
convergence in total variation, so it suffices to show that, Ti/„(P*")(^) converges to e^^X^/zl, for all 
z>0. 

Note that Ti/„(P™) = (Ti/„ (P))*" , and that © implies the following elementary bounds for all a, 
using Jensen's inequality: 

oo 

T„(P)(0) = 5^P(x)(l-«)^> (l-«)^ (10) 

x=0 
oo 

T„(P)(1) = ^P(a;)a;a(l-a)^-\ 

x=l 

Since for i.i.d. variables Yi, the probability Pr{Yi + . . . + Yn = z} > (") Pi{Yi = l}^P{Fi = 0}"'\ 
taking a = 1/n we obtain. 



n- / / 1\ \ / 1 



1 - - 

z\ \ ^ — ' 'V n I I \ n 



Now, for any fixed value of z and n tending to infinity. 



z\ z\ 



and 



and by monotone convergence. 



Therefore, 



1 - - I ^ e-\ 



fjP(x)x (l-i) 



A. 



liminf(Ti/„(P))*"(^)>Po(A,^). 



Since all (Ti/„(P))*" are probability mass functions and so is Po(A), the above liminf is necessarily a 
limit. ■ 
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As usual, the entropy of a probability distribution P on Nq is defined by, 

HiP) = -J2Pik) \ogP{k). 

k>0 

Theorem 12 (thermodynamic version): For any Poisson bounded distribution P on No with mean A, 

//(Ti/„(P*")) -> H{Fo{X)), as n ^ oo. 

Proof: The distribution Ti/„ (P*") converges pointwise to the Poisson distribution so, by dominated 
convergence, it is sufficient to prove that — Ti/„ (•p*n~) (^j^^ ^Ti/„ (p*n'j ^j,^^ dominated by a summable 
function. This easily follows from the simple bound in the following lemma. ■ 

Lemma 13: Suppose P is Poisson bounded with ratio fi. Then, P{x) < Po (/i, x) ■ e'^, for all x > 0. 

Proof: Note that, for all x, 

oo 

P(x)a;^< ^P(x) a;^</i^ 

SO that, in particular, P{x)x- < jj^, and, P{x) < ^ = Po(/i,a;) e'^. ■ 

According to [|20l Proof of Theorem 2.5], H (Ti/„ (P*")) < (Po (A)) if P is ultra log-concave, so 
for such distributions the theorem states that the entropy converges to its maximum. For ultra log-concave 
distributions the thermodynamic version also implies convergence in information divergence. This also 
holds for Poisson bounded distributions, which is easily proved using dominated convergence. As shown in 
the next theorem, convergence in information divergence can be established under quite general conditions. 

Theorem 14 (strong version): For any distribution P on No with mean A and Z^(P||Po(A)) < oo, 

D(ri/„(P*")||Po(A)) ^0, as n ^ oo. 

The proof of Theorem [14] is given in the Appendix; it is based on a straightforward but somewhat 
technical application of the following general bound. 

Proposition 15: Let X be a random variable with distribution P on No and with finite mean A/a, for 
some a e (0, 1). If P'(P||Po(A/a)) < oo, then. 



D(r,(P)||Po(A))<— - + E 

— a) 



aX log 



A 



< oo. (11) 



Proof: First note that, since P has finite mean, its entropy is bounded by the entropy of a geometric 
with the same mean, which is finite, so H(P) is finite. Therefore, the divergence D(P||Po(A/a)) can be 
expanded as. 



D(P||Po(A)) = E 



log 



VPo(A/a,X) 



= E[log(X!)] + --if(P)--logf- 
a a \a 

> ^£;[log+(27rX)] + E[X\ogX] - H{P) --\og(-], (12) 



10 



where the last inequality follows from the Stirling bound, 

log(x!) > - log^(27rx) + xlogx — x, 

and log'''(a:;) denotes the function logmaxja;, 1}. Since D(P\\Fo(\)) < oo, (fT2l) implies that _E[XlogX] 
is finite. [Recall the convention that OlogO = 0.] 

Also note that the representation of Tq,(P) in ^ can be written as, 

oo 

TaP{z) = P{x) Pr{Bin(x, a) = z). 

Using this and the joint convexity of information divergence in its two arguments (see, e.g., [6^ Theo- 
rem 2.7.2]), the divergence of interest can be bounded as. 



L'(T„(P)||Po(A)) = D K;]P(x)Bin 



x=0 



Y,P{x)Vo{\) 



x=0 



<^P(a;)D(Bin(a;,a)||Po(A)), (13) 

a;=0 

where the first term (corresponding to x = 0) equals A. Since the Poisson measures form an exponential 
family, they satisfy a Pythagorean identity |[8l which, together with the bound, 

2 

D(Bin(z,p)||Po(xp)) < ^^l_^y (14) 
see, e.g., [T8l or ['24'|, gives, for each x > 1, 

P)(Bin(x,a)||Po(A)) = P)(Bin(a;, a)||Po(ax)) + P)(Po(aa;)||Po(A)) 

^ V^o I -w f {axY exp{—ax)/ j\ 



< 



+ Y'Po{ax,j) log 



2(l-«) ■ ^ ^ "V A:'exp(-A)/j! 



+ ^axlog y~ J — ax + Aj . 



2(1 - a) 

Since the final bound clearly remains valid for x = 0, substituting it into (fT3l) gives ([TT]). 
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IV. Laws of Thin Numbers: The Non-i.i.d. Case 

In this section we state and prove more general versions of the law of thin numbers, for sequences 
of random variables that are not necessarily independent or identically distributed. Although some of the 
results in this section are strict generalizations of Theorems \TT\ and [14l their proofs are different. 

We begin by showing that, using a general proof technique introduced in [[24|. the weak law of thin 
numbers can be established under weaker conditions than those in Theorem [TTJ The main idea is to use 
the data-processing inequality on the total variation distance between an appropriate pair of distributions. 

Theorem 16 (weak version, non-i.i.d.): Let Pi, P2, ... be an arbitrary sequence of distributions on No, 
and write P*^"^ = Pi * P2 * • ■ ■ * P„ for the convolution of the first n of them. Then, 

||Ti/„(p("))-Po(A)|| ^0, n^oo, 

as long as the following three conditions are satisfied as n ^ 00: 

(a) a„ = maxi<i<„, [l - Ti/„Pi(0)] 0; 

(b) &„ = Er=i [1-Ti/„P,(0)] ^A; 

(c) Cn = Eti [1 - Ti/nm) - Ti/„P,(1)] ^ 0. 

Note that Theorem [16] can be viewed as a one-dimensional version of Grigelionis' Theorem 1 in lfT4l : 
recall the relevant comments in the Introduction. Recently, Schuhmacher [f30l established nonasymptotic, 
quantitative versions of this result, in terms of the Barbour-Brown distance, which metrizes weak conver- 
gence in the space of probability measures of point processes. As the information divergence is a finer 
functional than the Barbour-Brown distance, Schuhmacher's results are not directly comparable with the 
finite-n bounds we obtain in Propositions [151 [El and Corollary [32l 

Before giving the proof of the theorem, we state a simple lemma on a well-known bound for ||Po(A) — 
Po(/i)||. Its short proof is included for completeness. 

Lemma 17: For any A,;U > 0, 

||Po(A) -Po(/i)|| <2[l-e-l^-'^lj <2|A-/i|. 

Proof: Suppose, without loss of generality, that A > yU, and define two independent random variables 
X ~ Po(;u) and Z ~ Po(A — /i), so that, F = X + Z ~ Po(A). Then, by the coupling inequality n26l . 

||Po(A) - Po(/i)|| < 2Pr{X ^ Y} = 2Pr{Z 7^ 0} = 2[1 - e^^^^'^^j. 

The second inequality in the lemma is trivial. ■ 



Proof of Theorem \T6s First we introduce some convenient notation. Let Xi, X2, . . . be independent 
random variables with Xj ~ Pj for all i; for each n > 1, let y}"'\ ^2''"^ • • • be independent random variables 
with f/"^ ~ Ti/nPi for all i; and similarly let z[^\ ... be independent Po(Aj-"'') random variables, 
where A,-"^ = ri/„Pj(l), for %,n>\. Also we define the sums, Sn = Y^'^=i^}^^ ^^'^ = SiLi ^i"^' 
and note that, Sn ~ P^"^ and T„ ~ Po(X^''h where A^'^) = Y.7=i ^S"^ for all ^ > 1- 

Note that A^") — > A as n — >^ 00, since, 

n 

1=1 
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and, by assumption, 6„ X and c„ ^ 0, as n ^ oo. 
With these definitions in place, we approximate, 

||Ti/„(p(«)) -Po(A)|| < ||Ti/„(pW) -Po(A("))|| + ||Po(A(")) -Po(A)||, (15) 

where, by Lemma [TTl the second term is bounded by 2|A'-"'' — A| which vanishes as n — > oo. Therefore, 
it suffices to show that the first term in (fT5l) goes to zero. For that term, 

||Tv„(P("))-Po(A(«))|| = \\Ps„-PtJ 

n 

< $^||Ti/„P.-Po(aS"))|| 

1=1 

n 



< E [ll^Vn^. - Bern(Af || + ||Bem(Af)) - Po(A;")) 

i=l 



where the first inequality above follows from the fact that, being an /-divergence, the total variation 
distance satisfies the data-processing inequality [[8||; the second inequality comes from the well-known 
bound on the total variation distance between two product measures as the sum of the distances between 
their respective marginals; and the third bound is simply the triangle inequality. 

Finally, noting that, for any random variable X ~ P, ||P — Bem(P(l))|| = Pr{X > 2}, and also 
recalling the simple estimate, 

||Bem(^)) - Po(p)|| = p(l - e'P) < 

yields, 

n 

I|ri/„(P('^)) - Po(A("))|| < c„ + y^iXP'f < c„ + A^'^) max aJ^^ < c„ + A(")a„, 

^ l<i<n 
i=l 

and, by assumption, this converges to zero as n — > oo, completing the proof. ■ 

Recall that, in the i.i.d. case, the weak law of thin numbers only required the first moment of P to be 
finite, while the strong version also required that the divergence from P to the Poisson distribution be 
finite. For a sum of independent, non-identically distributed random variables with finite second moments. 
Proposition \T5\ can be used as in the proof of Theorem [14] to prove the following result. Note that the 
precise conditions required are somewhat analogous to those in Theorem [T6l 

Theorem 18 (strong version, non-i.i.d.): Let Pi, P2, ... be an arbitrary sequence of distributions on No, 
where each Pi has finite mean Aj and finite variance. Writing P*^") for the convolution Pi * P2 * ■ ■ • * P„, 
we have, 

d(Ti/„(P("))||po(A)) ^0, n^oo, 

as long as the following two conditions are satisfied: 

(a) A(-) = iEr=i^i^^'asn^oo; 

(b) j:t=i m^f) < 

The proof of Theorem [T8] is given in the Appendix, and it is based on Proposition \T5\ It turns out that 
under the additional condition of finite second moments, the proof of Proposition \T5\ can be refined to 
produce a stronger upper bound on the divergence. 



13 



Proposition 19: If P is a distribution on Nq with mean A/a and variance < oo, for some a E (0, 1), 
then, 

D{T4P)\\Po{X)) < a' f TTT-^ + ^] . (16) 



,2(1 -a) A 

Proof: Recall that in the proof of Proposition [15] it was shown that, 

oo 

D(T,(P)||Po(A)) < ^P(a:)Z}(Bin(x,a)||Po(A)), (17) 



x=0 

where, 

„2 



D(Bm(.,„)l|Po(A)) < -— + A (-log (-) - ^ + 1 



and where in the last step above we used the simple bound y log y — y + 1 < y{y — I) — y + 1 = (y — 1)^, 
for y > 0. Substituting ([T8]) into ([HI) yields. 



Z3(T„(P)l|Po(A)) <J:P{x) (-^ + X[^- if) 

9 CXD 



x=0 

a' a'^ ^, . f A 



, , , y ^ ,Xj \ X 

2(1 - a A ^ ' ^ V a 

' x=\) 



+ 



2(1 -a) A 
as claimed. 



Using the bound (1161) instead of Proposition [T51 the following more general version of the law of thin 
numbers can be established: 

Theorem 20 (strong version, non-i.i.d.): Let {Xj} be a sequence of (not necessarily independent or 
identically distributed) random variables on No, and write P*^"^ for the distribution of the partial sum 
Sn = Xi + X2 + ■ ■ ■ + Xn, n > 1. Assume that the {Xi} have finite means and variances, and that: 

(a) They are "uniformly ultra bounded," in that, Var(Xj) < CE(Xi) for all i, with a common C < 00; 

(b) Their means satisfy E{Sn) ^ 00 as n — > 00; 

(c) Their covariances satisfy, 

J:l<^<,<nC0yiX.,,X,) _ 
(P(S„))2 

If in fact E{Xi) = A > for all i, then, 

lim Z}(Ti/„(p("))||Po(A)) = 0. 

n— >oo 

More generally, 

lim D(T,„(pW)||Po(A)) = 0, where a„ = A/P(^„). 
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Proof: Obviously it suffices to prove the general statement. Proposition [19] applied to P*^"^ gives, 
D(T.„(pW)||Po(A)) < al(—^ + ^^'^^-^' 



2(1 -a„ A 
I ^ AVar(S.) 



2(1 (i?(5„))2 

W^) ^ W t ^^^^^^^ ^ (OT 



The first and third terms tend to zero by assumptions (b) and (c), respectively. And using assumption (a), 
the second term is bounded above by, 

(^(50?^^^^"^' 

which also tends to zero by assumption (b). ■ 



15 



V. The Thinning Markov Chain 

Before examining the rate of convergence in the law of thin numbers, we consider a related and 
somewhat simpler problem for a Markov chain. Several of the results in this section may be of independent 
interest. The Markov chain we will discuss was first studied in [20, Proof of Theorem 2.5], and, within 
this context, it is a natural discrete analog of the Omstein-Uhlenbeck process associated with the Gaussian 
distribution. 

Definition 21: Let P be a distribution on Nq. For any a E [0, 1] and A > 0, we write U^{P) for the 
distribution, 

U^{P)=T^{P)*Vo{{l-a)\). 
For simplicity, U^{P) is often written simply as U^P. 

We note that U^U^ = U^^, and that, obviously, maps probability distributions to probability 
distributions. Therefore, if for di fixed A we define = U^-t for all t > 0, the collection {Q* ; t > 0} 
of linear operators on the space of probability measures on Nq defines a Markov transition semigroup. 
Specifically, for i, j G No, the transition probabilities, 

Q% = (Q'mij) = iUUmU) = (Te-*(5.)*Po((l-e-*)A))(j) = Pr{Bin(z,e-*)+Po((l-e-*)A) = j}, 

define a continuous-time Markov chain {Zt ; t > 0} on Nq. It is intuitively clear that, as a | (or, 
equivalently, t oo), the distribution U^P should converge to the Po(A) distribution. Indeed, the following 
two results state that {Zt} is ergodic, with unique invariant measure Po(A). Theorem [28] gives the rate at 
which it converges to Po(A). 

Proposition 22: For any distribution P on No, (P) converges in total variation to Po (A), as « | 0. 

Proof: From the definition of U^{P), 

\\U^{P)-Pom = ||T„(P)*Po((l-«)A)-Po(A)|| 

= ||(T„(P)-Po(aA))*Po((l-a)A)|| 

< ||T,(P)-Po(«A)|| (19) 

1 1 °° 

= -1(1 - r„(P)(0)) - (1 - (Po(aA, 0))| + \TaiP)ix) - Po(aA,x)| 



2,v ....... V V V , . 2 

x=l 



< 



1 1 °° 

-[(1 - T,(P)(0)) + (1 - (Po(aA, 0))] + - J2iTa{P){x) + Po(aA, x)) (20) 

x=l 

2-T„(P)(0)-Po(aA,0), (21) 



where (|T9l) follows from the fact that convolution with any distribution is a contraction with respect to 
the norm, (|20l) follows from the triangle inequality, and (|2T]) converges to zero because of the bound 

m. m 



Using this, we can give a characterization of the Poisson distribution. 

Corollary 23: Let P denote a discrete distribution with mean A. If P = U^{P) for some a E (0, 1), 
then P = Po(A). That is, Po(A) is the unique invariant measure of the Markov chain {Zt}, and, moreover, 

D{U^{P)\\Po{X)) ^ 0, asaiO, 
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if and only if D{U^{P)\\Fo{X)) < oo for some a > 0. 

Proof: Assume that P = (P). Then for any n, P = U^n (P), so for any e > 0, by Proposition! 
II P — Po(A)|| = \\U^n (P) — Po(A)|| < e for n sufficiently large. The strengthened convergence of 
D{U^{P)\\Fo{X)) to zero if D(?7^(P)||Po(A)) < oo can be proved using standard arguments along 
the lines of the corresponding discrete-time results in [[T2ll [|3l [fT6ll . ■ 

Next we shall study the rate of convergence of (P) to the Poisson distribution. It is easy to check 
that the Markov chain {Zt} is in fact reversible with respect to its invariant measure Po(A). Therefore, 
the natural setting for the study of its convergence is the space of functions / : No — > M such that, 
E[f(ZY] < oo for Z ~ Po(A). This space is also endowed with the usual inner product, 

{f,g)=E[fiZ)g{Z)], forZ~Po(A), f,geL', 

and the linear operators act on functions / G by mapping each / into, 

(f/^/)(x) = ^[/(Z«,A,.)] for ~ U^{S,). 

In other words, 

{U^f){x) = E[Ziog(i/„)|Zo = x], xe No. 

The reversibility of {Zt] with respect to Po(A) implies that is a self-adjoint linear operator on L^, 
therefore, its eigenvectors are orthogonal functions. In this context, we introduce the Poisson-Charlier 
family of orthogonal polynomials P^ : 

Definition 24: For given A, the Poisson-Charlier polynomial of order k is given by. 



Some well-known properties of the Poisson-Charlier polynomials are listed in the following lemma 
without proof. Note that their exact form depends on the chosen normalization; other authors present 
similar results, but with different normalizations. 

Lemma 25: For any A, /i, A; and t. 

1) (P,\P/) = 4£ (22) 



3) P^{x + 1) - P^{x) = (^\"pl,{x) (24) 



(A(A; + 1))V2 
J. 

4) P.'^'ix + y) = E ( 0) "'(1 - '^'Peix)PUy)^ (25) 

where a = A/(A + /i). (26) 

Observe that, since the Poisson-Charlier polynomials form an orthonormal set, any function f E L"^ 
can be expanded as, 

oo 

f{x)=J2{f,Pk)Pki^)- (27) 

k=0 
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It will be convenient to be able to translate between factorial moments and the "Poisson-Charlier moments," 
E [P^ (X)]. For example, if X ~ Po(A), then taking £ = in ^ shows that E[P^{X)] = for all 
k > 1. More generally, the following proposition shows that the role of the Poisson-Charlier moments 
with respect to the Markov chain {Zt} is analogous to the role played by the factorial moments with 
respect to the pure thinning operation; cf. Lemma |3l Its proof, given in the Appendix, is similar to that 
of Lemma [3l 

Proposition 26: Let X ~ P be a random variable with mean A and write Xa,x for a random variable 
with distribution U^{P). Then, 

E[P^{X^^x)\=a^E [P^{X)\. 

If we replace a by exp (— t) and assume that the thinning Markov chain {Zt} has initial distribution 
Zq P with mean A, then. Proposition [26] states that. 



E[P,\Zt)] = e-''E[P,\Zo)], 

that is, the Poisson-Charlier moments of Zt tend to like exp {—kt) E [P^ (^o)] • Similarly, expanding any 
/ G in terms of Poisson-Charlier polynomials, / (x) = J2T=o{f^ Pk)Pk (^)' ^'^d using Proposition 



E[f{Zt)] = E 



Y.^f.Pk)Pk{Zt 



.k=0 



^exp(-H) {f,P,')E [P.'iX)]. 

k=0 

Thus, the rate of convergence of {Zt} will be dominated by the term corresponding to E [P^ (X)] , where 
K is the first k > 1 such that E [P^ (X)] ^ 0. 

The following proposition (proved in the Appendix) will be used in the proof of Theorem [28] below, 
which shows that this is indeed the right rate in terms of the distance. Note that there is no restriction 
on the mean of X ~ P in the proposition. 

Proposition 27: If X ~ P is Poisson bounded, then the the likelihood ratio P/Po(A) can be expanded 
as: 

p^ = 5^E[P,\X)]P,\x), x>0. 
Assuming X ~ P G PB{X), combining Propositions [26] and [271 we obtain that. 



U^Pix) 



E [P^ (X„,,)] P', 

k=0 

oo 

l + Y^a'E [P,\X)]P> 

k=K 

oo 

l + a^Y,(^''"E [P^{X)] P,\x), (28) 



k=K 

where, as before, k, denotes the first integer k > 1 such that E [P^ (X)] ^ 0. This sum can be viewed 
as a discrete analog of the well-known Edgeworth expansion for the distribution of a continuous random 
variable. A technical disadvantage of both this and the standard Edgeworth expansion is that, although the 
sum converges in L^, truncating it to a finite number of terms in general produces an expression which 
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may take negative values. By a more detailed analysis we shall see in the following two sections how to 
get around this problem. 

For now, we determine the rate of convergence of U^P to Po(A) in terms of the distance between 
U^P and Po(A); recall the definition of the distance between two probability distributions P and Q 
on Nq: 



Theorem 28: If X ~ P is Poisson bounded, then x (U^P, Po(A)) is finite for all a E [0, 1] and. 



where k denotes the smallest A; > such that E [P^ (X)] ^ 0. 

Proof: The proof is based on a Hilbert space argument using the fact that the Poisson-Charlier 
polynomials are orthogonal. Suppose X ~ P G PP(/i). Using Proposition |27l 

2 



,=(C/iP,Po(A))^gPo(A..)(|g|_i) 

OO / oo 

= 5^Po(A,x)K^a'=£;[P,^X)]P, 



x=0 \k=K 

oo 

= j2c^''E[p,\x)r, 

k=K 

where the last step follows from the orthogonality relation (22). For a = 1 we have, 

/ P(x) 



X^(P,Po(A)) = gPo(A,x) 



x) 



' P(x' 



x-0 

which is finite. From the previous expansion we see that x^([/^P, Po(A)) is increasing in a, which implies 
the finiteness claim. Moreover, that expansion has «^'^P[P^(X)]^ as its dominant term, implying the stated 
limit. ■ 

Theorem [28] readily leads to upper bounds on the rate of convergence in terms of information divergence 
via the standard bound, 

DiPWQ) < log(l + x\P, Q)) < x'iP, Q), 

which follows from direct applications of Jensen's inequality. Furthermore, replacing this bound by the 
well-known approximation [[8|, 

D{P\\Q)^^-xHP,Q), 

gives the estimate, 

D{U^P\\Po{X)) ^ a"^ ^ = 2 

We shall later prove that, in certain cases, this approximation can indeed be rigorously justified. 



19 



VI. The Rate of Convergence in the Strong Law of Thin Numbers 

Let X ~ P be a random variable on No with mean A. In Theorem [T4l we showed that, if L'(P||Po(A)) 
is finite, then, 

D(Ti/„(P*")||Po(A)) ^0, asn^oo. (29) 
If P also has finite variance cr^, then Proposition [T9l implies that, for all n > 2, 



D(Ti/„(P™)||Po(A)) <^ + -J^, (30) 

suggesting a convergence rate of order 1/n. In this section, we prove more precise upper bounds on the 
rate of convergence in the strong law of thin numbers (|29l ). For example, if X is an ultra bounded random 
variable with cr^ ^ A, then we show that in fact, 

limsupn^P) (Ti/„(P*")||Po(A)) < 2c^ 

n^oo 

where c = P[P2^(X)] = (ct^ _ \)/(\^/2) ^ 0. This follows from the more general result of Corollary [32l 
its proof is based on a detailed analysis of the scaled Fisher information introduced in in [|24ll . We begin 
by briefly reviewing some properties of the scaled Fisher information: 

Definition 29: The scaled Fisher information of a random variable X ~ P with mean A, is defined by, 

K{X) = K{P) = XE [px {Xf] 
where px denotes the scaled score function, 

o M-i^±I)j^(^ 1 

^""^ xpi^) 

In [|24l Proposition 2] it was shown, using a logarithmic Sobolev inequality of Bobkov and Ledoux BU, 
that for any X ~ P, 

Z}(P||Po(A)) <ir(X), (31) 

under mild conditions on the support of P. Also, [|24l Proposition 3] states that K{X) satisfies a 
subadditivity property: For independent random variables Xi, X2, . . . , X„, 



^ E^O ^E^^ra (32) 

\i=l J i=l 

where A = particular, recalling that the thinning of a convolution is the convolution of the 

corresponding thinnings, if Xi,X2, . . . ,X„ are i.i.d. random variables with mean A then the bounds in 
(|3T]) and ^ imply, 

D (Ti/„(P*")||Po (A)) < K (Ti/„(P)) . (33) 

Therefore, our next goal is to determine the rate at which K{Ta{X)) tends to for a tending to 0. We 
begin with the following proposition; its proof is given in Appendix. 
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Proposition 30: If X ~ P is Poisson bounded, then P admits the representation, 



Moreover, the truncated sum from £ = to m is an upper bound for P{x) if m is even, and a lower 
bound if m is odd. 

An important consequence of this proposition is that TaP{x) tends to zero like a^, as a | 0. Moreover, 
it leads to the following asymptotic result for the scaled Fisher information, also proved in the Appendix. 

Theorem 31: Suppose X ~ P has mean A and it is ultra bounded with ratio A. Let k denote the 
smallest integer k>l such that E [P^ (X)] 7^ 0. Then, 



where c = E [P^ (X)] . 

Combining Theorem [311 with (l33l) immediately yields: 

Corollary 32: Suppose X ~ P has mean A and it is ultra bounded with ratio A. Let k denote the 
smallest integer A; > 1 such that E [P^ (X)] 7^ 0. Then, 

limsupn'P) (Ti/„ (P*") ||Po (A)) < /^c^ 



lim 



KiT^P) 




n- 



■00 



where c = E [P^ (X)] . 
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VII. MONOTONICITY RESULTS FOR THE SCALED FiSHER INFORMATION 

In this section we establish a finer result for the behavior of the scaled Fisher information upon thinning, 
and use that to deduce a stronger finite-n upper bound for the strong law of thin numbers. Specifically, if 
X ~ P is ULC with mean A, and denotes a random variable with distribution T^P, we will show that 
K{Xa) < a'^K{X). This implies that, for all ULC random variables X, we have the following finite-n 
version of the strong law of thin numbers, 

D(Ti/„(P*")||Po(A))<^. 

Note that, unlike the more general result in (l30l) which gives a bound of order 1/n, the above bound is 
of order as long as X is ULC. 

The key observation for these results is in the following lemma. 

Lemma 33: Suppose X is a ULC random variable with distribution P and mean A. For any a E (0, 1), 
write Xa for a random variable with distribution Tq,P. Then the derivative of K{Xa)/a with respect to 
a satisfies, 

oa \ a J 

where, for a random variable Y with mass function Q and mean /i, we define, 

^ Q(i/ + i)(t/ + i) / g(y + i)(i/ + i) _ g(y + 2)(y + 2) \ ' 
^ ^ ^ i^Q{y) V Q{y) + ) ' 

Proof: This result follows on using the expression for the derivative of T^P arising as the case 
f{a) = g{a) = in Proposition 3.6 of [|20iL that is, 

^(T,P)(x) = -\x{T^P){x) - (x + l)(T«P)(x) 
oa a L 

Using this, for each x we deduce that, 

d f{{T^p){x + i)y{x + i] 



da V a^{TaP){x)X 

{T^P) {x + 1) (a; + 1) / (r„P) (x + 1) (x + 1) (T,P) {x + 2) (x + 2) 



+ 



«3A V {T^P){x) (T„P)(x+l) 

((T,P)(x + l))\x + Ifx {{T^P){x + 2)f{x + 2)\x + 1) 



a^\\ {T^P){x) (T,P)(x + l) 

The result follows (with the term-by-term differentiation of the infinite sum justified) if the sum of these 
terms in x is absolutely convergent. The first terms are positive, and their sum is absolutely convergent 
to S by assumption. The second terms form a collapsing sum, which is absolutely convergent assuming 
that, 

^{{T^P){x + l)f{x + lfx 

y ^ — — \„ , . — - — < oo. 



x=0 



{TaP){x) 



Note that, for any ULC distribution Q, by definition we have for all x, (x + l)Q{x + 1)/Q{x) < 
xQ{x) /Q{x — 1), so that the above sum is bounded above by, 

||^J](T„P)(x + l)(x + l)x, 
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which is finite by Proposition |9l ■ 

We now deduce the following theorem, which parallels Theorem 8 respectively of [|35l . where a 
corresponding result is proved for the information divergence. 

Theorem 34: Let X ~ P be a ULC random variable with mean A. Write Xa for a random variable 
with distribution T^P. Then: 

(i) K{X^) <a^K{X), ae(0,l); (34) 

(zz) Z}(Ti/„(P™)||Po(A))<^^, n>2. (35) 

Proof: The first part follows from the observation that KiTaX) / is increasing in a, since, by 
Lemma |33l its derivative is {S{T^X) - K{T^X))/a-^. Taking g{y) = P{y + l)iy + l)/P{y) in the more 
technical Lemma [35] below, we deduce that S{Y) > K(Y) for any random variable Y, and this proves (i). 
Then (ii) immediately follows from (i) combined with the earlier bound (l33l) . upon recalling that thinning 
preserves the ULC property [20J. ■ 

Consider the finite difference operator A defined by, {Ag){x) = g{x + 1) — g{x), for functions g : 
No — * M. We require a result suggested by relevant results in [f5ll [|23l . Its proof is given in the Appendix. 

Lemma 35: Let Y be ULC random variable with distribution P on Nq. Then for any function g, defining 

oo oo 

P{y) (giy) - /^)' < E + + ^)^9{yr- 
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VIII. Bounds in Total Variation 

In this section, we show that a modified version of the argument used in the proof of Proposition 
gives an upper bound to the rate of convergence in the weak law of small numbers. If X ~ P has mean 
A and variance cr^, then combining the bound (fT6l ) of Proposition [T9l with Pinsker's inequality we obtain, 

"^■/.(^"■)-Po(A)ll<( ,„.(^^_,„_.) +g'^ (36) 

which gives an upper bound of order rT^^"^. From the asymptotic upper bound on information divergence. 
Corollary [32l we know that one should be able to obtain upper bounds of order n^^ . Here we derive an 
upper bound on total variation using the same technique used in the proof of Proposition [191 

Theorem 36: Let P be a distribution on No with finite mean A and variance cr^. Then, 

\\TyniPn - Po(A)|| < + {l' ^ 

for all n > 2. 



The proof uses the following simple bound, which follows easily from a result of Yannaros, 1134 
Theorem 2.3]; the details are omitted. 

Lemma 37: For any A > 0, m > 1 and t E (0, 1/2], we have, 

||Bin(m,t) - Po(A)|| < 12'^^^ + \mt - A| min |l, • 



Proof: The first inequality in the proof of Proposition [T9| remains valid due to the convexity of 
the total variation norm (since it is an /-divergence). The next equality becomes an inequality, and it is 
justified by the triangle, and we have: 



||Ti/n(P*") - Po(A)l| = ^Y1 V«) = ^} - Po(A 

< J2 ^*"(?/)^ E |P^{Bin(l/, l/n) = x} - Po(A, : 
= ^P™(l/)||Bin(y,l/n)-Po(A)||. 

And using Lemma [37] leads to. 



|rv„(P™) -Po(A)|| < J]P™(y)||Bin(y,l/n) -Po(A)l|. 



y>o 

and the result follows by an application of Holder's inequality 
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IX. Compound Thinning 

There is a natural generalization of the thinning operation, via a process which closely parallels the 
generalization of the Poisson distribution to the compound Poisson. Starting with a random variable F ~ P 
with values in Nq, the a-thinned version of Y is obtained by writing F = 1 + 1 + -- - + 1(F times), and 
then keeping each of these Is with probability a, independently of all the others; cf. ([T]) above. 

More generally, we choose and fix a "compounding" distribution Q on N = {1, 2, . . .}. Given F ~ P 
on No and a G [0, 1], then the compound a-thinned version of Y with respect to Q, or, for short, the 
(a, Q)-thinned version ofY, is the random variable which results from first thinning Y as above and then 
replacing of the Is that are kept by an independent random sample from Q, 

Y 

J2 Bn^n, Bi ~ i.i.d. Bern(a), ~ i-i-d. Q, (37) 

n=l 

where all the random variables involved are independent. For fixed a and Q, we write Ta^Q{P) for the 
distribution of the (a, Q) -thinned version of F ~ P. Then Ta,Q{P) can be expressed as a mixture of 
"compound binomials" in the same way as Ta{P) is a mixture of binomials. The compound binomial 
distribution with parameters n,a,Q, denoted CBm{n,a,Q), is the distribution of the sum of n i.i.d. 
random variables, each of which is the product of a Bern(«) random variable and an independent ^ ~ Q 
random variable. In other words, it is the (a, (5)-thinned version of the point mass at n, i.e., the distribution 
of (l37l) with Y = n w.p.l. Then we can express the probabilities of the (a, Q) -thinned version of P as, 
r,,Q(P)(A;) = E£>fc^W Pr{CBin(£,a,Q) = k}. 

The following two observations are immediate from the definitions. 

1) Compound thinning maps a Bernoulli sum into a compound Bernoulli sum: If P is the distribution of 
the Bernoulli sum ^"^^^ Pj where the Pj are independent Bern(pj), then q(P) is the distribution 
of the "compound Bernoulli sum," Y17=iBi^i where the P- are independent Bem(Q;pi), and the 
are i.i.d. with distribution Q, independent of the Pj. 

2) Compound thinning maps the Poisson to the compound Poisson distribution, that is, q(Po(A)) = 
CPo{aX,Q), the compound Poisson distribution with rate a\ and compounding distribution Q. 
Recall that CPo(A, Q) is defined as the distribution of. 

Tlx 
i=l 

where the are as before, and 11^ is a Po(A) random variable that is independent of the ^j. 
Perhaps the most natural way in which the compound Poisson distribution arises is as the limit of 
compound binomials. That is, CBin(?2, X/n, Q) CPo(A, Q), as n oo, or, equivalently, 

Ti/„,Q (Bin(n, A)) = Ti/„,q(P™) ^ CPo(A, Q), 

where P denotes the Bem(A) distribution. 

As with the strong law of thin numbers, this results remains true for general distributions P, and the 
convergence can be established in the sense of information divergence: 

Theorem 38: Let P be a distribution on Nq with mean A > and finite variance cr^. Then, for any 
probability measure Q on N, 

P'(ri/„,Q(P*")||CPo(A,Q)) ^0, as n oo, 
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as long as D{P\\Po{X)) < oo. 

The proof is very similar to that of Theorem [14] and thus omitted. In fact, the same argument as that 
proof works for non-integer-valued compounding. That is, if Q is an arbitrary probability measure on 
W'-, then compound thinning a No-valued random variable F ~ P as in (|37l) gives a probability measure 



It is somewhat remarkable that the statement and proof of most of our results concerning the information 
divergence remain essentially unchanged in this case. For example, we easily obtain the following analog 
of Proposition [T9l 

Proposition 39: If P is a distribution on No with mean A/a and variance cr^ < oo, for some a E 
(0, 1), then, for any probability measure Q on W'-, 



The details of the argument of the proof of the proposition are straightforward extensions of the 
corresponding proof of Proposition \T9\ 



T^,q{P) on M'^. 
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Appendix 

Proof of Lemma ^ Simply apply Lemma |4] to Definition [T] with F ~ P, to obtain, 

^ k 



x=l 



E\E\{^ BJ \Y 

x=l 

Y 

ka:e{0,l},J2kx=k X = l 

'Y 



{e[ Y. k>llB^\Y]} 

X 



= a''E[Y% 

using the fact that the sequence of factorial moments of the Bern(a) distribution are {1, a, 0, 0, . . .}. ■ 

Proof of Proposition]^ Assume that Ta^P = Ta^Q for a given ao > 0. Then, recalling the property 
stated in it follows that, T^P = T^Q for all a G [0,ao]- In particular, T^P(O) = T^Q{0) for all 
a e [0, ao], i.e., 

OO OD 

5^P(x)(l-«r = 5^Q(x)(l-ar, 

x=0 x=0 

for all a E [0, ao], which is only possible if P(x) = Q{x) for all x > 0. ■ 
Proof of Proposition ^ Note that the expectation, 

by the Chebyshev rearrangement lemma, since it is the covariance between an increasing and a decreasing 
function. Rearranging this inequality gives, 

OO OO 

E[X^] = ^ P(x + l)(x + 1)^ <^Y^ P(x)x^ = \E[X% 

x=0 x=0 

as required. ■ 
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Proof of Proposition [7^- To prove part (a), using Lemma HI we have, 



E[{X + y)-] = e\^[ ]x^Y!^-' 



= E 

1=0 
k 

^=0 



E[X^]E[f'^ 



= {\ + ^^f. 

It is straightforward to check, using Lemma [3l that T^P E PB(a\). 

To prove part (b), using Lemma IH Pascal's identity and relabelling, yields. 



E[{X + Y) 



k+ll 



= E 
= E 

k+l 

= E 

£=0 
k 

= E 

k 
£=0 



E 

■fc+i 

E 

.£=0 



k + l 



k 



H/k+1- 



X'-Y- 



f—r, \ / 



£=0 



XE [X'-] E [r^] + (^)i? [X'-] ^lE [r^] 



= (A + ^)E[(x + r)^]. 

The second property is again easily checked using Lemma [3l ■ 

Proof of Theorem \T4\ In order to apply Proposition [TSl with P*" in place of P and a = 1/n, we need to 
check that D{P*"'\\Fo(n\)) is finite. Let Sn denote the sum of n i.i.d. random variables Xi ~ P, so that 
P*" is the distribution of Sn- Similarly, Po(nA) is the sum of n independent Po(A) variables. Therefore, 
using the data-processing inequality |[8l as in [|24ll implies that P'(P*"||Po(nA)) < nD(P||Po(A)), which 
is finite by assumption. 

Proposition \T5\ gives, 

1 



Z)(Ti/„(P*") ||Po(A))< 



+ E[{SJn) \og{SJn)]-X\ogX. 



2n\\ - 1/n) 

By the law of large numbers, Snjn — > A a.s., so {Sn/n) \og{Sn/n) — > A log A a.s., as n ^ oo. Therefore, 
to complete the proof it suffices to show that {Sn/n)\og{Sn/n) converges to A log A also in L\ or, 
equivalently, that the sequence {T„ = {Sn/n)\og{Sn/n)} is uniformly integrable. We will actually show 
that the nonnegative random variables T„ are bounded above by a different uniformly integrable sequence. 
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Indeed, by the log-sum inequality, 

T 




< -Vx,logX,. (38) 



n 

i=l 

Arguing as in the beginning of the proof of Proposition \T5\ shows that the mean fi = E[Xi logXi] is finite, 
so the law of large numbers implies that the averages in (l38l) converge to /i a.s. and in . Hence, they 
form a uniformly integrable sequence; this implies that the T„ are also uniformly integrable, completing 
the proof. ■ 

Proof of Theorem [7^- The proof is similar to that of Theorem [T4l so some details are omitted. For each 
n > 1, let A*^") = ^ Yll=i ^^'^ write Sn = J2'i=i where the random variables Xi are independent, 
with each Xi Pi. 

First, to see that D(p(")||Po(nA*^"))) is finite, applying the data-processing inequality [[8| as in [|24l 
gives, D(P'^")||Po(nA*^"^)) < X]r=i D{Pi\\Po{Xi)), and it is easy to check that each of these terms is finite 
because all Pi have finite second moments. As before. Proposition [T5] gives. 

1 

2^2 (l~ l/n) 

Letting Yi = Xi — Aj for each i, the independent random variables Yi have zero mean and. 



D(ri/„(p("))||Po(A("))) < ^ + E[{SJn) logiSJn)] - A^") log A^"). (39) 



i=l i=l 1=1 

which is finite by assumption (b). Then, by the general version of the law of large numbers on [[TTl p. 239], 
^Sr=i^* ~^ 0' hence, by assumption (a), Sn/n A a.s., so that also, {Sn/n)\og{Sn/n) 

A log A a.s., as n oo. Moreover, since (a;logx)'^''^ < for every integer a; > 1, we have. 



n n / I \ \ n 



1 " 1 

Z — / ^ — ^ 

i=l l<i7^i<" 

n 



i=l 

which is uniformly bounded over n by our assumptions. Therefore, the sequence {{Sn/n) log(5'„/r;,)} is 
bounded in W with p = 4/3 > 1, which implies that it is uniformly integrable, therefore it converges to 
A log A also in L\ so that, P)(Ti/„(p(")) ||Po(A("))) ^ as n ^ oo. 

Finally, recalling once more that the Poisson measures form an exponential family, they satisfy a 
Pythagorean identity [8J, so that 

D(Ti/„(p("))||Po(A)) = D(Ti/„(p("))||Po(A("))) + D(Po(A("))||Po(A)), 

where the first term was just shown to go to zero as n — > oo, and the second term is actually equal to, 

\{«) 

A(") log— + A-A("), 
A 
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which also vanishes as n ^ oo by assumption (a). ■ 

Proof of Proposition |2^- Let and Z denote independent random variables with distributions T^P 
and Po((l — a)X), respectively. Then from the definitions, and using Lemmas IH and |3l 

k 



^=0 

^ ' ^=0 ^ ^ m=0 ^ ^ 

\£~m 



a"^E(X^)((l-a)A)^ 



where we have used the fact that the factorial moments of a Po(t) random variable Zt satisfy, E[Zf] = t". 
Simplifying and interchanging the two sums. 



mix.,)] . ^j:(^^y'^Eix^^j:Q-_2) 



m=0 l=m, 
k 

k~m 



as claimed. 



Proof of Proposition \27\ First we have to prove that P/Po(A) G L^. Assume P is Poisson bounded 
with ration ji, say. Using the bound in Lemma [131 



which is finite. 

Now, recalling the general expansion (|T7l) . it suffices to show that (P/Po(A), P^) = E[P^{X)]. Indeed, 
for Z ~ Po(A), 



P 



>Po(A) 
as required. 

Proof of Proposition\3^ We need the following simple lemma; for a proof see, e.g., lfT3l . 
Lemma 40: If 



"1 / \ 
F(m,x) = 5^r (-1 
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then 



F (m, x) > Sx for m even, 
F (m, x) < 5r for m odd. 



Turning to the proof of Proposition [301 assume X ~ P is Poisson bounded with ratio A. Then the series 
in the statement converges, since 



-y 



(-1) 



< 



^ oo 

-y 



Po(A, x) < oo. 



For m even we have, 



therefore. 



Multiplying by -P(-z) and summing over z. 



x+l 



t=0 



1 - , F{X^-^\ 



^=0 

A similar argument holds for m odd. ■ 

Proof of Theorem |371" Let have distribution T^P ■ Using Lemma [3l Proposition [30l and the fact that 
X is ultra bounded, the score function of Xa can be bounded as, 

[^ + l)T,P(z + i: 



aXTaP(z) 



- 1 



aX(^E[X^]-E[X^])/z\ 

a'+^E[X^] 
aX {a'E[X^ - a'+^E[X^]) ~ 

E[X^] \ 

< [1-Aa]"'-1 
aX 
1 — aX 

Since the lower bound px^iz) > — 1 is obvious, it follows that, 

PXaiz)^ < 1, for all a > small enough. 



(40) 
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We express K{TaP) in three terms: 

K—2 oo 

K{T^P) = \aY,TaP{z)pxAz? + \aT^P{t^-l)pxS^ - 1)' + Aa ^ T,P(^)pxJ^)'. (41) 

Z=0 Z = K 

For the third term note that, applying Markov's inequality to the function /(x) = x(x — 1) ■ ■ ■ (x — k + 1), 
which increases on the integers, we obtain, 

Kl Kl Kl 

Therefore, using this and (l40l) . for small enough a > the third term in ((4T]) is bounded above by, 

aA ; > U, 

which, divided by a'^, tends to zero as a — ^ 0. 

For the other two terms we use the full expansion of Proposition [30l together with Lemma [H to obtain 
a more accurate expression for the score function, 

{z + l)T^P{z + 1) - a\T^P{z) 



aXTaP{z) 

(^^a1^i=o\-^) — j\ — 
YZo (-1)' [E [X^±l±^] - \E [Xi±i] ) lt\ 

Since, by assumption, E [X^±i±^] — \E [X— ] = for 2; + £ < /t — 1, the first terms in the series in the 
numerator above vanish. Therefore, 

^ ^,._._i Er=o (-1)^+--^ (i^ [xa.] - \E [X^]) /(^ + 1)! 

For z < K—2, the numerator and denominator above are both bounded functions of a, and the denominator 
is bounded away from zero (because of the term corresponding to £ = 0). Therefore, for each < z < k—2, 
the score function pxdz) is of order a'^^''^^. For the first term in (|4TI) we thus have, 

K-2 K-2 

z=0 z=0 

which, again, when divided by a'^, tends to zero as a — 0. 

Thus only the second term in (|4TI) contributes. For this term, we similarly obtain, 

lim p V (K — l) = hm — — F — - — 



■'1= 



E [X^] - XE [Xi 



XE [X!^] 
E [X^^] - A'^ 
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and, 



lim = lim 

a^O a'^ a^O [K — 1)'. a'^ 

= 1™ 7 m 

a^O [K — 1)! 



Finally, combining the above limits with (|4T1) yields, 

K{TaX) fE[X^]-X' 



(43) 



a^o a'^ {k - l)\ V A'' 

/ E [X^] -X^'^ 

= kE [P, (X)]' , 

as claimed. ■ 

Proof of Lemma \35} The key is to observe that for Y ULC, since P{y + !)(?/ + 1)/P{y) is decreasing 
in y, and y is increasing in y, there exists an integer yo such that P{y + l){y + 1) < yoP{y) for y > yo 
and P(?/ + + 1) > yoP{y) for ?/ < yo- Hence: 

oo oo 

Y,Piy)iy-yo) = p(2: + + 1) + (p(y + i)(y + 1) -yoP(y)) 

y=z+l y=z+l 

< iz+l)P{z + l), forz>yo; 

z z 

J2Piy)iyo-y) = P{z + i){z + i)-J2iPiy + ^)iy + ^)-yoP{y)) 

< {z+l)P{z + l), for z< 2/0-1. 
Further, by Cauchy-Schwarz, for y > y^, 

{9{y) - 9{yo)f =(y. ^9{z)] <{y- yo) ^ai^A , (44) 

\z=yo / \z=yo / 

while for y < yo — 1> 

/j/o-i \2 /y-i \ 

igiy) - 9{yo)f = J2 ^^{z) <{yo-y)[Yl ^aizf • (45) 

V z=y / \z=yo / 

This means that (with the reversal of order of summation justified by Fubini, since all the terms have the 
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same sign), 

OD 



y=o 

oo 
y=0 

yo— 1 oo 

= E ^(^) (9iy) - 9iyo)f + Yl P(y) (9iy) - g{y^)f 

y=o y=yo 

yo-i /s/o-i \ oo / y-i \ 

< E ^(^)(^o - E ^^(^)' + E ^(^)(^/ - ^/o) E ^^(^)' (46) 

J/=0 \z=»/ / S/=S/o \^=S/0 / 

J/0-1 / z \oo /oo 

< E^^WME^(^/)(^/o-^) +E^^WM E P(y)(y-yo) 

2=0 \j/=0 / z=yo \y=z+l 

oo 

< E(^^?)W'^(^+1)(^+1)' (47) 

and the result holds. Note that the inequality in (|46l ) follows by (l44l) and (|45] ). and the inequality in (|47] ) 
by the discussion above. ■ 
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